Importing libraries to use for project.
### Data Imports
employeeData = read.csv("CaseStudy2-data.csv", sep = ",")
employee_original = employeeData # make copy of original dataset
head(employeeData)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 3 80 1
## 2 3 1 80 0
## 3 3 3 80 0
## 4 3 3 80 2
## 5 3 3 80 0
## 6 4 3 80 2
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 3 2 5
## 2 21 2 4 20
## 3 10 2 3 2
## 4 14 3 3 14
## 5 6 2 3 6
## 6 9 4 2 9
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 2 0 3
## 2 7 4 9
## 3 2 2 2
## 4 10 5 7
## 5 3 1 3
## 6 7 1 7
In this section , I will be checking for nulls and encoding columns with characters variables to numeric variables. There is nothing to clean here as there are no nulls in the dataset and the data is high in integrity.
### Check for Missing Values
sum(is.na(employeeData))
## [1] 0
# Make overtime & Attrition column binary
employeeData$cleanOverTime = ifelse(employeeData$OverTime=="Yes",1,0)
employeeData$cleanAttrition = ifelse(employeeData$Attrition=="Yes",1,0)
First, we started of will looking at the attrition count. From our bar chart, we find that the “No” Attrition outnumbers the “Yes” by nearly 7 to 1.
Some variables that should be included are Monthly Income, Overtime, Job Level and Job Role.
Our Attrition vs. Monthly Income histogram shows that as the salary of an individual increases, they are less likely to quit. Next, we find that those who work overtime are more likely ot quit. A little over 25% of those who work overtime quite which makes sense since they are more prone to stress. Job Role seems to be a big factor in those who quite as well. Almost 50% of those who work as sales representative quit and about 25% of those who work in human resources do.
Furthermore, from our pair plots, we find that the length some works is somewhat correlated with the attrition rate as well. The correlation hovers between 0.59 and0.78 which a strong and positive correlation, not very strong. However, it is enough to make an impact.
#### Exploratory Data Analysis
#### Attrition EDA
employeeData %>% ggplot(aes(x=Attrition,fill=Attrition)) +
geom_bar()+
ggtitle("Attrition Count") +
xlab("Attrition")+ylab("Count")
### Percentage Compares for Job Role
ggplot(employeeData, aes(x = JobRole, fill = Attrition)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ggtitle("Attrition Rate By Job Role")
### Attrition Vs. Job Satisfaction
employeeData %>%
ggplot(aes(x=JobSatisfaction,fill=Attrition))+
geom_bar()+
ggtitle("Attrition Vs. Job Satisfaction")
### Attrition Vs. MonthlyIncome
employeeData %>% ggplot(aes(x=MonthlyIncome,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. MonthlyIncome")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. DistanceFromHome
employeeData %>%
ggplot(aes(x=DistanceFromHome,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. DistanceFromHome")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. MonthlyRate
employeeData %>% ggplot(aes(x=MonthlyRate,fill=Attrition))+geom_histogram()+ggtitle("Attrition Vs. MonthlyRate")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. OverTime
employeeData %>%
ggplot(aes(x=OverTime,fill=Attrition))+
geom_bar(position="fill")+ggtitle("Attrition Vs. Overtime")+
scale_y_continuous(labels = scales::percent)
### Attrition Vs. years Since last Promotion
employeeData %>%
ggplot(aes(x=YearsSinceLastPromotion,fill=Attrition))+
geom_bar(position="fill")+ggtitle("Attrition Vs. Years Since Last Promotion") +
scale_y_continuous(labels = scales::percent)
### Attrition Vs. Salary Hike - NO
employeeData %>%
ggplot(aes(x=PercentSalaryHike,fill=Attrition))+
geom_histogram(position="fill")+
ggtitle("Attrition Vs. Percent Salary Hike")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing missing values (geom_bar).
### Attrition Vs. Age
employeeData %>%
ggplot(aes(x=Age,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition vs Marital Status
employeeData %>%
ggplot(aes(x=MaritalStatus,fill=Attrition))+
geom_bar(position="fill")+
ggtitle("Attrition Vs. Marital Status")
### Attrition Vs. PercentSalaryHike
employeeData %>%
ggplot(aes(x=PercentSalaryHike,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. PercentSalaryHike")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. WorkLifeBalance
employeeData %>%
ggplot(aes(x=WorkLifeBalance,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. WorkLifeBalance")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. StockOptionLevel
employeeData %>%
ggplot(aes(x=StockOptionLevel,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. StockOptionLevel")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. TrainingTimesLastYear
employeeData %>%
ggplot(aes(x=TrainingTimesLastYear,fill=Attrition))+
geom_histogram()+
ggtitle("Attrition Vs. TrainingTimesLastYear")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. PerformanceRating
employeeData %>%
ggplot(aes(x=PerformanceRating,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. PerformanceRating")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. RelationshipSatisfaction
employeeData %>%
ggplot(aes(x=RelationshipSatisfaction,fill=Attrition))+
geom_histogram()+ggtitle("Attrition Vs. RelationshipSatisfaction")+
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Attrition Vs. Joblevel
employeeData %>%
ggplot(aes(x=JobLevel,fill=Attrition))+
geom_bar()+ggtitle("Attrition Vs. Joblevel")+
scale_y_continuous(labels = scales::percent)
### Attrition Vs. BusinessTravel
employeeData %>%
ggplot(aes(x=BusinessTravel,fill=Attrition))+
geom_bar()+ggtitle("Attrition Vs. BusinessTravel")+
scale_y_continuous(labels = scales::percent)
#### Monthly Income EDA
### Compare Incomes
employeeData %>% group_by(Attrition) %>% summarise(compareincomes=mean(MonthlyIncome))
## # A tibble: 2 × 2
## Attrition compareincomes
## <chr> <dbl>
## 1 No 6702
## 2 Yes 4765.
### Job Role vs. Monthly Salary
employeeData %>%
ggplot(aes(x=JobRole,y = MonthlyIncome, fill = JobRole))+
geom_boxplot()+ggtitle("Monthly Income vs. Job Role")+
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
ggpairs(employeeData[,c(3,30, 33:36)], aes(color = Attrition))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
employeeData %>%
ggplot(aes(x=TotalWorkingYears,y = MonthlyIncome))+
geom_point()+ggtitle("Monthly Income vs. TotalWorkingYears")+
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
employeeData %>%
ggplot(aes(x=BusinessTravel,y = MonthlyIncome))+
geom_boxplot()+ggtitle("Monthly Income vs. BusinessTravel")+
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
### KNN Model
## Classify
employeeData2 = employeeData
employeeData2$Attrition = as.factor(employeeData2$Attrition)
# create dataset for KNN model
model = employeeData2[,-c(1,4,5,6,8,9,10,11,12,13,14,17,19,21,23,24,28,38)]
# oversample to make up for imbalance in dataset
model = oversample(model,classAttr = "Attrition",method = "ADASYN")
## ML
set.seed(124) # Changed seed multiple times to see how high it can go
iterations = 200
numks = 20
splitPerc = .70
masterAcc = matrix(nrow = iterations, ncol = numks)
for(j in 1:iterations)
{
trainIndices = sample(1:dim(model)[1],round(splitPerc * dim(model)[1]))
train = model[trainIndices,]
test = model[-trainIndices,]
for(i in 1:numks)
{
classifications = knn(train[,-c(2)],test[,-c(2)],train$Attrition, prob = TRUE, k = i)
table(classifications,test$Attrition)
CM = confusionMatrix(table(classifications,test$Attrition))
masterAcc[j,i] = CM$overall[1]
}
}
MeanAcc = colMeans(masterAcc)
plot(seq(1,numks,1),MeanAcc, type = "l")
which.max(MeanAcc)
## [1] 1
max(MeanAcc)
## [1] 0.7058681
As we build and analyze our KNN model, we find that KNN is not the best model to use. Our accuracy is abysmal. Our highest sensitivity is only slightly better at .6471 while our specificity is worse at 0.7719. As a result, we will move on to a different model.
classifications = knn(train[,-c(2)],test[,-c(2)],train$Attrition, prob = TRUE, k = 3)
table(classifications,test$Attrition)
##
## classifications No Yes
## No 138 52
## Yes 79 163
CM = confusionMatrix(table(classifications,test$Attrition))
CM
## Confusion Matrix and Statistics
##
##
## classifications No Yes
## No 138 52
## Yes 79 163
##
## Accuracy : 0.6968
## 95% CI : (0.651, 0.7398)
## No Information Rate : 0.5023
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.3939
##
## Mcnemar's Test P-Value : 0.02311
##
## Sensitivity : 0.6359
## Specificity : 0.7581
## Pos Pred Value : 0.7263
## Neg Pred Value : 0.6736
## Prevalence : 0.5023
## Detection Rate : 0.3194
## Detection Prevalence : 0.4398
## Balanced Accuracy : 0.6970
##
## 'Positive' Class : No
##
Now, want to use Naive-bayes to build our model. This model yielded us much better results with an accuracy of 0.8884, a Specificity of 0.9019 and sensitivity 0.8750. As a result, we find this model to be quite accurate. Therefore, the Naive-bayes model that we’ve built to predict attriton on the test dataset.
set.seed(25)
naive_data=employeeData
naive_data$Attrition = as.factor(naive_data$Attrition)
model2 = naive_data[,-c(1,2,4,5,7,9,10,11,13,
14,21:23,25,28,37,38)]
## get data ready for oversampling
#Convert all character variables to factors
model2 = model2 %>%
mutate_if(sapply(model2, is.character), as.factor)
#Convert all character variables to integer
model2 = model2 %>%
mutate_if(sapply(model2, is.factor), as.integer)
#oversample
model2 = oversample(model2,classAttr = "Attrition",method = "ADASYN")
#revert interger back to factor
model2 = model2 %>%
mutate_if(sapply(model2, is.integer), as.factor)
#change Attrition back to character and factor
model2$Attrition <- (ifelse(model2$Attrition==1,"No","Yes"))
model2$Attrition <- as.factor(model2$Attrition)
model2$Attrition = as.factor(model2$Attrition)
trainIndices = sample(1:dim(model2)[1],round(.70 * dim(model2)[1]))
train = model2[trainIndices,]
test = model2[-trainIndices,]
classifier1 = naiveBayes(Attrition~., data =model2)
pred = predict(classifier1,newdata=test)
CM = confusionMatrix(table(test$Attrition,pred))
CM
## Confusion Matrix and Statistics
##
## pred
## No Yes
## No 189 21
## Yes 27 193
##
## Accuracy : 0.8884
## 95% CI : (0.8547, 0.9165)
## No Information Rate : 0.5023
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7768
##
## Mcnemar's Test P-Value : 0.4705
##
## Sensitivity : 0.8750
## Specificity : 0.9019
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8773
## Prevalence : 0.5023
## Detection Rate : 0.4395
## Detection Prevalence : 0.4884
## Balanced Accuracy : 0.8884
##
## 'Positive' Class : No
##
employee_noattri = read.csv("CaseStudy2CompSet No Attrition (1).csv")
#employee_noattri
In this code block, we will use our Naive-Bayes model to predict attrition and store it in a variable.
# Convert overtime to binary
employee_noattri$cleanOverTime = ifelse(employee_noattri$OverTime=="Yes",1,0)
employee_noattri = employee_noattri %>%
mutate_if(sapply(employee_noattri, is.character), as.factor)
#Convert all character variables to integer
employee_noattri = employee_noattri %>%
mutate_if(sapply(employee_noattri, is.factor), as.integer)
employee_noattri = employee_noattri %>%
mutate_if(sapply(employee_noattri, is.integer), as.factor)
attrition_pred = predict(classifier1, employee_noattri)
attrition_pred
## [1] No No Yes No No No No No No Yes No Yes No No No No No No
## [19] No No Yes No No No No No Yes Yes Yes Yes No No No No Yes No
## [37] No No No No No No No No Yes No No Yes Yes No No Yes No No
## [55] No No No No No No No No Yes No No No No No No No Yes No
## [73] No No No Yes No No No No No No Yes No No No No No No Yes
## [91] No No No Yes No No No No Yes Yes No No No No Yes No No No
## [109] No No No No No No No No No No Yes No No No No No No No
## [127] Yes No No No No No No Yes No No No Yes No Yes No No No No
## [145] No No Yes Yes No No No No No No No Yes No No Yes No Yes Yes
## [163] Yes No No No No No No No Yes No Yes Yes No No Yes Yes No No
## [181] Yes No No No No No No Yes No No No No No No Yes No No No
## [199] No No No No No No No No No No No No Yes No No No No No
## [217] Yes No No No No No No No No No Yes No No No Yes No No Yes
## [235] No No Yes No No No No No No No No No No No No Yes Yes No
## [253] No Yes No No No No No No No No No No No No No No Yes No
## [271] No Yes No Yes Yes No Yes No Yes Yes No No No No Yes Yes No Yes
## [289] No Yes No No Yes No Yes No No Yes No No
## Levels: No Yes
employee_noattri$AttritionPred = attrition_pred
#employee_noattri
filtered_noattrition = employee_noattri %>%
select(ID, AttritionPred) %>% arrange(ID)
head(filtered_noattrition)
## ID AttritionPred
## 1 1171 No
## 2 1172 No
## 3 1173 Yes
## 4 1174 No
## 5 1175 No
## 6 1176 No
write.csv(filtered_noattrition, "Case2PredictionsChang Attrition.csv")
employee_nosalary = read_excel("CaseStudy2CompSet No Salary (2).xlsx")
We will create a linear regression model to test all of the variables to decide on which one will be selected for our final linear regression model.To incoporate all the variables, I will be creating dummy columns to make various columns binary, I can use them in this model.
employeeData3 = employeeData2
employeeData3 = dummy_cols(employeeData3,
select_columns = c("BusinessTravel","Department",
"EducationField","Gender",
"JobRole", "MaritalStatus" ))
employeeData3$JobRole_Others = ifelse(employeeData3$JobRole_Manager == 1|employeeData3$`JobRole_Research Director` == 1|
employeeData3$`JobRole_Sales Executive` == 1, 0, 1)
employeeData3$BusinessTravel_Others = ifelse(employeeData3$`BusinessTravel_Non-Travel` == 1, 0, 1)
employeeData3[,c(39:66)] = employeeData3[,c(39:66)] %>%
mutate_if(sapply(employeeData3[,c(39:66)], is.numeric), as.factor)
lm_salarydf = employeeData3[,c(2,5,7,8,12,14,15,16,18, 20,
21,25,26,27,29,30, 31,32,33,34,
35,36,37, 39:64)]
lmsalary_model = lm(MonthlyIncome~.,
data = lm_salarydf)
summary(lmsalary_model)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3632.4 -659.6 2.9 620.1 4165.7
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.819e+02 5.470e+02 0.698 0.48531
## Age -1.510e+00 5.630e+00 -0.268 0.78867
## DailyRate 1.450e-01 9.122e-02 1.590 0.11222
## DistanceFromHome -6.536e+00 4.531e+00 -1.442 0.14956
## Education -3.259e+01 3.689e+01 -0.883 0.37735
## EnvironmentSatisfaction -6.510e+00 3.349e+01 -0.194 0.84592
## HourlyRate -3.380e-01 1.824e+00 -0.185 0.85307
## JobInvolvement 1.065e+01 5.236e+01 0.203 0.83894
## JobLevel 2.785e+03 8.341e+01 33.394 < 2e-16 ***
## JobSatisfaction 2.314e+01 3.295e+01 0.702 0.48276
## MonthlyRate -9.379e-03 5.141e-03 -1.824 0.06844 .
## PercentSalaryHike 2.532e+01 1.582e+01 1.600 0.10989
## PerformanceRating -3.237e+02 1.615e+02 -2.004 0.04537 *
## RelationshipSatisfaction 1.514e+01 3.309e+01 0.457 0.64749
## StockOptionLevel 4.146e+00 5.682e+01 0.073 0.94185
## TotalWorkingYears 5.204e+01 1.043e+01 4.990 7.37e-07 ***
## TrainingTimesLastYear 2.171e+01 2.904e+01 0.748 0.45497
## WorkLifeBalance -3.909e+01 5.139e+01 -0.761 0.44706
## YearsAtCompany -5.591e+00 1.321e+01 -0.423 0.67218
## YearsInCurrentRole 4.987e+00 1.699e+01 0.294 0.76916
## YearsSinceLastPromotion 3.160e+01 1.522e+01 2.077 0.03813 *
## YearsWithCurrManager -2.652e+01 1.666e+01 -1.591 0.11190
## cleanOverTime 2.343e+00 8.080e+01 0.029 0.97687
## `BusinessTravel_Non-Travel`1 -3.787e+02 1.198e+02 -3.160 0.00163 **
## BusinessTravel_Travel_Frequently1 -1.773e+02 9.700e+01 -1.827 0.06801 .
## BusinessTravel_Travel_Rarely1 NA NA NA NA
## `Department_Human Resources`1 4.256e+02 4.871e+02 0.874 0.38258
## `Department_Research & Development`1 5.625e+02 3.307e+02 1.701 0.08936 .
## Department_Sales1 NA NA NA NA
## `EducationField_Human Resources`1 -7.526e+01 3.841e+02 -0.196 0.84472
## `EducationField_Life Sciences`1 3.728e+01 1.366e+02 0.273 0.78490
## EducationField_Marketing1 9.674e+00 1.794e+02 0.054 0.95700
## EducationField_Medical1 -7.245e+01 1.416e+02 -0.512 0.60892
## EducationField_Other1 -1.773e+01 1.951e+02 -0.091 0.92760
## `EducationField_Technical Degree`1 NA NA NA NA
## Gender_Female1 -1.125e+02 7.442e+01 -1.512 0.13097
## Gender_Male1 NA NA NA NA
## `JobRole_Healthcare Representative`1 -9.286e+01 3.907e+02 -0.238 0.81219
## `JobRole_Human Resources`1 -2.787e+02 5.158e+02 -0.540 0.58919
## `JobRole_Laboratory Technician`1 -6.875e+02 3.705e+02 -1.855 0.06388 .
## JobRole_Manager1 4.186e+03 3.565e+02 11.744 < 2e-16 ***
## `JobRole_Manufacturing Director`1 7.693e+01 3.884e+02 0.198 0.84304
## `JobRole_Research Director`1 3.960e+03 4.352e+02 9.098 < 2e-16 ***
## `JobRole_Research Scientist`1 -4.370e+02 3.699e+02 -1.181 0.23775
## `JobRole_Sales Executive`1 4.251e+02 1.904e+02 2.233 0.02583 *
## `JobRole_Sales Representative`1 NA NA NA NA
## MaritalStatus_Divorced1 -2.900e+01 1.343e+02 -0.216 0.82905
## MaritalStatus_Married1 4.348e+01 1.023e+02 0.425 0.67089
## MaritalStatus_Single1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1056 on 827 degrees of freedom
## Multiple R-squared: 0.9498, Adjusted R-squared: 0.9472
## F-statistic: 372.3 on 42 and 827 DF, p-value: < 2.2e-16
Looking at the Linear Regression model that we created above, we will want to select all of the columns where the p-value yields a significant result. We will take those values and add it to our final linear regression model to predict salary. Please note that column 39-44 indicates all the Business travel levels while 53-61 refers to all the Job Role.
lm_salarydf = employeeData3[,c(16, 20,21, 26, 30, 35,39:44, 53:61)] # 39-44 is Business Travel, 53-61 is Jobe Role
lmsalary_model2 = lm(MonthlyIncome~.,
data = lm_salarydf)
summary(lmsalary_model2)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = lm_salarydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3777.9 -633.7 -4.8 619.2 4107.0
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.448e+02 3.627e+02 0.399 0.689770
## JobLevel 2.779e+03 8.153e+01 34.086 < 2e-16 ***
## MonthlyRate -9.052e-03 5.056e-03 -1.790 0.073748 .
## PerformanceRating -1.379e+02 1.004e+02 -1.373 0.170137
## TotalWorkingYears 4.499e+01 8.342e+00 5.393 8.99e-08 ***
## YearsSinceLastPromotion 1.564e+01 1.275e+01 1.227 0.220322
## `BusinessTravel_Non-Travel`1 -3.983e+02 1.173e+02 -3.395 0.000718 ***
## BusinessTravel_Travel_Frequently1 -1.972e+02 9.512e+01 -2.073 0.038450 *
## BusinessTravel_Travel_Rarely1 NA NA NA NA
## `Department_Human Resources`1 3.978e+02 4.425e+02 0.899 0.368936
## `Department_Research & Development`1 5.527e+02 3.232e+02 1.710 0.087578 .
## Department_Sales1 NA NA NA NA
## `JobRole_Healthcare Representative`1 -1.113e+02 3.865e+02 -0.288 0.773496
## `JobRole_Human Resources`1 -3.424e+02 5.078e+02 -0.674 0.500233
## `JobRole_Laboratory Technician`1 -7.331e+02 3.664e+02 -2.001 0.045706 *
## JobRole_Manager1 4.156e+03 3.511e+02 11.837 < 2e-16 ***
## `JobRole_Manufacturing Director`1 1.860e+01 3.840e+02 0.048 0.961381
## `JobRole_Research Director`1 3.921e+03 4.295e+02 9.129 < 2e-16 ***
## `JobRole_Research Scientist`1 -4.694e+02 3.655e+02 -1.284 0.199417
## `JobRole_Sales Executive`1 3.681e+02 1.852e+02 1.988 0.047168 *
## `JobRole_Sales Representative`1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1053 on 852 degrees of freedom
## Multiple R-squared: 0.9485, Adjusted R-squared: 0.9475
## F-statistic: 923.9 on 17 and 852 DF, p-value: < 2.2e-16
RSS = c(crossprod(lmsalary_model2$residuals))
MSE = RSS / length(lmsalary_model2$residuals)
RMSE = sqrt(MSE)
sig2 = RSS / lmsalary_model2$df.residual
RMSE
## [1] 1042.313
# create dummy variables
employee_nosalary= dummy_cols(employee_nosalary,
select_columns = c("BusinessTravel","Department",
"EducationField","Gender",
"JobRole", "MaritalStatus" ))
employee_nosalary$cleanOverTime = ifelse(employee_nosalary$OverTime=="Yes",1,0)
employee_nosalary$JobRole_Others = ifelse(employee_nosalary$JobRole_Manager == 1|
employee_nosalary$`JobRole_Research Director` ==1|
employee_nosalary$`JobRole_Sales Executive` == 1, 0, 1)
employee_nosalary$BusinessTravel_Others = ifelse(employee_nosalary$`BusinessTravel_Non-Travel` == 1, 0, 1)
# change dummy variables to factors
employee_nosalary[,c(36:63)] = employee_nosalary[,c(36:63)] %>%
mutate_if(sapply(employee_nosalary[,c(36:63)], is.numeric), as.factor)
# predict salary
salary_pred = predict(lmsalary_model2, employee_nosalary)
## Warning in predict.lm(lmsalary_model2, employee_nosalary): prediction from a
## rank-deficient fit may be misleading
#salary_pred = unlist(salary_pred)
employee_nosalary$SalaryPred =salary_pred
head(employee_nosalary)
## # A tibble: 6 × 65
## ID Age Attrition Busine…¹ Daily…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 871 43 No Travel_… 1422 Sales 2 4 Life S… 1
## 2 872 33 No Travel_… 461 Resear… 13 1 Life S… 1
## 3 873 55 Yes Travel_… 267 Sales 13 4 Market… 1
## 4 874 36 No Non-Tra… 1351 Resear… 9 4 Life S… 1
## 5 875 27 No Travel_… 1302 Resear… 19 3 Other 1
## 6 876 39 Yes Travel_… 895 Sales 5 3 Techni… 1
## # … with 55 more variables: EmployeeNumber <dbl>,
## # EnvironmentSatisfaction <dbl>, Gender <chr>, HourlyRate <dbl>,
## # JobInvolvement <dbl>, JobLevel <dbl>, JobRole <chr>, JobSatisfaction <dbl>,
## # MaritalStatus <chr>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # Over18 <chr>, OverTime <chr>, PercentSalaryHike <dbl>,
## # PerformanceRating <dbl>, RelationshipSatisfaction <dbl>,
## # StandardHours <dbl>, StockOptionLevel <dbl>, TotalWorkingYears <dbl>, …
## # ℹ Use `colnames()` to see all variable names
filtered_nosalary = employee_nosalary %>% select(ID, SalaryPred) %>% arrange(ID)
head(filtered_nosalary)
## # A tibble: 6 × 2
## ID SalaryPred
## <dbl> <dbl>
## 1 871 5572.
## 2 872 2662.
## 3 873 12258.
## 4 874 1935.
## 5 875 2497.
## 6 876 6114.
write.csv(filtered_nosalary, "Case2PredictionsChang Salary.csv")